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NODE FRQW CLIENT 101 



LOOKUP SERVICE FOR PACKfc I 
BASED ON DESTINATION ADDRESS 
OF PACKET 602 



(54) Method and apparatus for performing a fast service lookup in cluster networking 



(57) One embodiment of the present invention pro- 
vides a system that uses a destination address of a 
packet to perform a fast lookup to determine a service 
that is specified by the destination address. The system 
initially receives (601) a packet at an interface node in 
the cluster of nodes. This packet includes a source ad- 
dress specifying a location of a client that the packet 
originated from, and the destination address specifying 
a service provided by the cluster of nodes. The system 
uses the destination address to perform a first lookup 
(602) into a first lookup structure containing identifiers 
for scalable services. Note that a scalable service is a 
service that provides more server node capacity for the 
scalable service as demand for the scalable service in- 
creases. If no identifier for a scalable service is returned 
during the first lookup, the system sends the packet to 
a server node in the cluster of nodes that provides a non- 
scalable service. 
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Description 
Related Applications 

[0001] The subject matter of this patent application is 
related to the subject matter in the following co-pending 
non-provisional patent applications filed on the same 
day as the instant application: (1) "Method and Appara- 
tus for Providing Scalable Services Using a Packet Dis- 
tribution Table," by inventors Sohrab F. Modi, Sankar 
Ramamoorthi, Mahalingam Mani, Brian M. Oki, Kevin 
C, Fox and Hariprasad B. Mankude, serial number TO 
BE ASSIGNED, filing date TO BE ASSIGNED (Attorney 
Docket No. SUN-P438B-ARG); (2) "Method and Appa- 
ratus for Fast Packet Forwarding in Cluster Networking, 
" by inventors Hariprasad B. Mankude and Sohrab F. 
Modi, serial number TO BE ASSIGNED, filing date TO 
BE ASSIGNED (Attorney Docket No. SUN- 
P4416-ARG); (3) "Network Client Affinity For Scalable 
Services," by inventors Sohrab F. Modi, Sankar Rama- 
moorthi, Kevin C. Fox, and Tom Lin, serial number TO 
BE ASSIGNED, filing date TO BE ASSIGNED (Attorney 
Docket No. SUN1P384); and (4) "Method For Creating 
Forwarding Lists For Cluster Networking," by inventors 
Hariprasad B. Mankude, Sohrab F. Modi. Sankar Ram- 
amoorthi, Mani Mahalingam and Kevin C. Fox, serial 
number TO BE ASSIGNED, filing date TO BE AS- 
SIGNED (Attorney Docket No. SUN1 P385). 

BACKGROUND 

Field of the Invention 

[0002] The present invention relates to clustered 
computer systems with multiple nodes that provide serv- 
ices in a scalable manner. More specifically, the present 
invention relates to a method and an apparatus that us- 
es a destination address to perform a fast lookup to de- 
termine a service for a packet. 

Related Art 

[0003] The recent explosive growth of electronic com- 
merce has led to a proliferation of web sites on the In- 
ternet selling products as diverse as toys, books and au- 
tomobiles, and providing services, such as insurance 
and stock trading. Millions of consumers are presently 
surfing through web sites in order to gather information, 
to make purchases, or purely for entertainment. 
[0004] The increasing traffic on the Internet often 
places a tremendous load on the servers that host web 
sites, Some popular web sites receive over a million 
"hits" per day. In order to process this much traffic with- 
out subjecting web surfers to annoying delays in retriev- 
ing web pages, it is necessary to distribute the traffic 
between multiple server nodes , so that the multiple serv- 
er nodes can operate in parallel to process the traffic. 
[0005] In designing such a system to distribute traffic 



between multiple server nodes, a number of character- 
istics are desirable. It is desirable for such a system to 
be efficient in order to accommodate as much traffic as 
possible with a minimal amount of response time. It is 

s desirable for such a system to be "scalable," so that ad- 
ditional server nodes can be added an distribution to the 
nodes can be modifiable to provide a service as demand 
for the service increases. In doing so, it is important to 
ensure that response time does not increase as addt- 

10 tional server nodes are added. It is also desirable for 
such a system to be constantly available, even when 
individual server nodes or communication pathways be- 
tween server nodes fail. 

[0006] A system that distributes traffic between mul- 
15 tiple server nodes typically performs a number of tasks. 
Upon receiving a packet, the system looks up a service 
that the packet is directed to. (Note that a collection of 
server nodes will often host a number of different serv- 
ers.) What is needed is a method and an apparatus for 
20 performing a service lookup that is efficient, scalable 
and highly available. 

[0007] Once the service is determined, the system 
distributes workload involved in providing the service 
between the server nodes that are able to provide the 

25 service. For efficiency reasons it is important to ensure 
that packets originating from the same client are direct- 
ed to the same server. What is needed is a method and 
an apparatus for distributing workload between server 
nodes that is efficient, scalable and highly available. 

30 [0008] Once a server node is selected for the packet, 
the packet is forwarded to the server node. The conven- 
tional technique of using a remote procedure call (RPC) 
or an interface definition language (IDL) call to forward 
a packet typically involves traversing an Internet Proto- 

35 col (IP) stack from an RPC/IDL endpoint to a transport 
driver at the sender side, and then traversing another I P 
stack on the receiver side, from a transport driver to an 
RPC/IDL endpoint. Note that traversing these two IP 
stacks is highly inefficient. What is needed is a method 

40 and an apparatus for forwarding packets to server 
nodes that is efficient, scalable and highly available. 

SUMMARY 

45 [0009] One embodiment of the present invention pro- 
vides a system that uses a destination address of a 
packet to perform a fast lookup to determine a service 
that is specified by the destination address. The system 
initially receives a packet at an interface node in the 

so cluster of nodes. This packet includes a source address 
specifying a location of a client that the packet originated 
from, and the destination address specifying a service 
provided by the cluster of nodes. The system uses the 
destination address to perform a first lookup into a first 

55 lookup structure containing identifiers for scalable serv- 
ices. Note that a scalable service is a service that pro- 
vides more server node capacity for the scalable service 
as demand for the scalable service increases. If no iden- 
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tifier for a scalable service is returned during the first 
lookup, the system sends the packet to a server node 
in the cluster of nodes that provides a non-scalable serv- 
ice. 

[001 0] In one embodiment of the present invention, if 
an identifier for a scalable service is returned for the 
packet, the system looks up a server node to send the 
packet to, based upon the source address of the packet 
(and possibly the destination address of the packet) and 
sends the packet to the server node. 
[0011] !n one embodiment of the present invention, 
the system looks up the server node by performing a 
function that maps the source address to an entry in a 
packet distribution table (PDT), which includes entries 
containing identifiers for server nodes. In a variation on 
this embodiment, the function is a hash function that 
maps different source addresses to different entries in 
the packet distribution table in a substantially random 
manner, so that a given source address always maps to 
the same entry in the packet distribution table. 
[0012] In one embodiment of the present invention, 
the system allows the server node to send return com- 
munications directly to the client without forwarding the 
return communications through the interface node. 
[0013] In one embodiment of the present invention, 
the first lookup structure is a hash table containing the 
identifiers for the scalable services. 
[0014] In one embodiment of the present invention, if 
the first lookup does not return an identifier for a scalable 
service, the system uses the destination address to per- 
form a second lookup into a second lookup structure 
containing identifiers forscalable services. In a variation 
on this embodiment, the first lookup is based upon an 
Internet Protocol (IP) address and an associated port 
number, and the second lookup is based upon the IP 
address without the associated port number. 
[0015] In one embodiment of the present invention, 
the first lookup structure includes identifiers forscalable 
services that use a first load balancing policy to distrib- 
ute packets between server nodes, and the second 
lookup structure includes identifiers for scalable servic- 
es that use a second load balancing policy. In a variation 
on this embodiment, the second load balancing policy 
locates related services for a given client on the same 
server node. 

[0016] In one embodiment of the present invention, if 
no scalable service is returned for the packet, the sys- 
tem allows a server instance on the interface node to 
provide the service. 

[0017] In one embodiment of the present invention, 
the system periodically sends checkpointing information 
from a PDT server node to a secondary PDT server 
node so that the secondary PDT server node is kept in 
a consistent state with the PDT server node. This allows 
the secondary PDT server node to take over for the PDT 
server node if the PDT server node fails. 
[0018] In one embodiment of the present invention, 
the system periodically sends checkpointing Information 



from a master PDT server node to at least one slave 
PDT server node so that the slave PDT servers are kept 
in a consistent state with the master PDT server. 
[0019] In one embodiment of the present invention, 
s the destination address includes an Internet Protocol 
(IP) address, an associated port number for the service 
and a protocol identifier (such as transmission control 
protocol (TCP) or user datagram protocol (UDP)). 



[0020] FIG. 1 illustrates a clustered computing system 
coupled to client computing systems through a network 
in accordance with an embodiment of the present inven- 
ts tion. 

[0021] FIG. 2 illustrates the internal structure of an in- 
terface node and two server nodes within a clustered 
computing system in accordance with an embodiment 
of the present invention. 
20 [0022] FIG. 3 illustrates data structures associated 
with a scalable service in accordance with an embodi- 
ment of the present invention, 

[0023] FIG. 4 illustrates how an IP packet is encapsu- 
lated with a DLPI header in accordance with an embod- 

25 iment of the present invention. 

[0024] FIG. 5A is a flow chart illustrating the process 
of service registration in accordance with an embodi- 
ment of the present invention. 
[0025] FIG. 5B is a flow chart illustrating the process 

30 of service activation/deactivation in accordance with an 
embodiment of the present invention. 
[0026] FIG. 6 is a flow chart illustrating how a packet 
is processed within an interface node in accordance with 
an embodiment of the present invention. 

35 [0027] FIG . 7 is a flow chart illustrating the process of 
looking up a service for a packet in accordance with an 
embodiment of the present invention. 
[0028] FIG. 8 is a flow chart illustrating the process of 
forwarding a packet to a server in accordance with an 

40 embodiment of the present invention. 

[0029] FIG. 9 illustrates how a PDT server is check- 
pointed to a slave PDT server and a secondary PDT 
server in accordance with an embodiment of the present 
invention. 

45 

DETAILED DESCRIPTION 

[0030] The following description is presented to ena- 
ble any person skilled in the art to make and use the 
so invention, and is provided in the context of a particular 
application and its requirements. Various modifications 
to the disclosed embodiments will be readily apparent 
to those skilled in the art, and the general principles de- 
fined herein may be applied to other embodiments and 
55 applications without departing from the spirit and scope 
of the present invention. Thus, the present invention is 
not intended to be limited to the embodiments shown, 
but is to be accorded the widest scope consistent with 
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the principles and features disclosed herein, 
[0031 ] The data structures and code described in this 
detailed description are typically stored on a computer 
readable storage medium, which may be any device or 
medium that can store code and/or data for use by a 
computer system. This includes, but is not limited to, 
magnetic and optical storage devices such as disk 
drives, magnetic tape, CDs (compact discs) and DVDs 
(digital video discs), and computer instruction signals 
embodied in a transmission medium (with or without a 
carrier wave upon which the signals are modulated). For 
example, the transmission medium may include a com- 
munications network, such as the Internet. 

Clustered Computing System 

[0032] FIG. 1 illustrates a clustered computing system 
1 00 coupled to clients 121 -1 23 through networks 1 20 in 
accordance with an embodiment of the present inven- 
tion. Clients 121-123 can include any node on networks 
120, including computational capability and including a 
mechanism for communicating across networks 120. 
Clients 1 21 -1 23 communicate with clustered computing 
system 100 by sending packets to clustered computing 
system 100 in order to request services from clustered 
computing system 100. 

[0033] Networks 120 can include any type of wire or 
wireless communication channel capable of coupling to- 
gether computing nodes. This includes, but is not limited 
to, a local area network, a wide area network, or a com- 
bination of networks. In one embodiment of the present 
invention, networks 120 includes the Internet. 
[0034] Clustered computing system 100 includes a 
set of nodes that are coupled together through a com- 
munication channel (not shown). These nodes include 
server nodes 102 and 104 as well as interface node/ 
server node 1 03. Nodes 1 02-1 04 are coupled to storage 
system 110. Storage system 1 1 0 provides archival stor- 
age for code and or data that is manipulated by nodes 
102-104. This archival storage may include, but is not 
limited to, magnetic storage, flash memory, ROM, 
EPROM, EE PROM, and battery-backed-up RAM. 
[0035] Nodes 1 02-1 04 are coupled together through 
a private interconnect with redundant pathways (not 
shown). For example, nodes 102-104 can be intercon- 
nected through a communication mechanism adhering 
to the Ethernet or a scalable coherent interconnect (SCI) 
standards. A path manager operates on all of the nodes 
in clustered computing system 1 00. This path manager 
knows about the interconnect topology and monitors the 
status of pathways. The path manager also provides an 
interface registry to which other components interested 
in the status of the interconnect can register. This pro- 
vides a mechanism for the path manager to make call- 
backs to the interested components when the status of 
a path changes, if a new path comes up, or if a path is 
removed. 

[0036] Nodes 102-104 are coupled to networks 120 



through a highly available addressing system 108. High- 
ly available addressing system 108 allows interface 
nodes within clustered computing system 100 to be ad- 
dressed from networks 120 in a "highly-available" man- 

s ner so that if an interface node fails, a backup secondary 
interface node is able to take its place without the failure 
being visible to clients 1 21 -1 23. Note that interface node 
1 03 can host one or more shared I P addresses for clus- 
tered computing system 1 00. Also note, than more that 

10 one node in clustered computing system 1 00 can act as 
an interface node for a given service. This allows a back- 
up interface node to take over for an interface node that 
fails. 

[0037] Note that nodes 1 02-1 04 within clustered corn- 
's puting system 1 00 can provide scalable services. Each 
scalable service behaves as a single logical entity from 
the view of clients 121-123. Also note that clients 
121-123 can communicate with clustered computing 
system 100 through a transmission control protocol 
20 (TCP) connection or a user datagram protocol (UDP) 
session. 

[0038] As load on a service increases, the service at- 
tempts to maintain the same per-client response time. 
A service is said to be "scalable" if increased load on 

25 the service is matched with an increase in hardware and 
server instances that are performing the service. For ex- 
ample, a web server is scalable if additional load on the 
web server is matched by a corresponding increase in 
server nodes to process the additional load, or by a 

30 change in the distribution of the load across the hard- 
ware and server instances that are performing the serv- 
ice. 

[0039] Clustered computing system 100 operates 
generally as follows. As packets arrive at interface node 

35 1 03 from clients 121-123, a service is selected for the 
packet based on the destination address in the packet. 
Next, a server instance is selected for the packet based 
upon the source address of the packet as well as the 
destination address of the packet. Note that the system 

40 ensures that packets belonging to the same TCP con- 
nection or UDP instance are sent to the same server 
instance. Finally, the packet is sent to the selected serv- 
er instance. 

45 Internal Structure of Interface Nodes and Server 
Nodes 

[0040] FIG. 2 illustrates the internal structure of inter- 
face node 103 and server nodes 102 and 104 within 

so clustered computing system 1 00 in accordance with an 
embodiment of the present invention. Client 121 sends 
packets to clustered computing system 100 in order to 
receive a service from clustered computing system 1 00. 
These packets enter public interface 221 within inter- 

55 face node 1 03 in clustered computing system 1 00. Pub- 
lic interface 221 can include any type of interface that is 
able to receive packets from networks 1 20. 
[0041] As packets arrive at interface node 103 via 
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public interface 221 , they pass through cluster network- 
ing multiplexer 21 8. Cluster networking multiplexer 21 8 
forwards the packets to various nodes within clustered 
computing system 100 based upon load balancing pol- 
icies and other considerations. In making forwarding de- 
cisions, cluster networking multiplexer 21 8 retrieves da- 
ta from highly available PDT server 230. The structure 
of this data is described in more detail below with refer- 
ence to FIG. 3. Note that HA PDT server 230 may be 
replicated across multiple nodes of clustered computing 
system 1 00 so that in case a node fails, a backup node 
can take over for it to maintain availability for HA PDT 
server 230. 

[0042] Packets are forwarded from interface node 
1 03 to other nodes clustered computing system 1 00, in- 
cluding server nodes 102 and 104, through private in- 
terfaces 224 and 225. Private interfaces 224 and 225 
can include any interface that can handle communica- 
tions between nodes within clustered computing system 
1 00. For example, packets can be forwarded from pri- 
vate interlace 224 to private interface 226 on server 
node 104, or from private interface 225 to private inter- 
face 228 on server node 1 02. Note that private interfac- 
es 224 and 225 do not handle communications with en- 
tities outside of clustered computing system 1 00. 
[0043] In some embodiments of the present invention, 
private interface 224 (and 225) and public interface 221 
share some of the same communication hardware and 
send messages down some of the same physical data 
paths. In some of these embodiments, private interface 
224 and public interface 221 may also share some of 
the same interface software. Hence, private interface 
224 and public interface 221 need not represent differ- 
ent communication mechanisms. Therefore, the distinc- 
tion between private interface 224 and public interface 
221 can be merely a distinction between whether the 
communications are with an entity outside of clustered 
computing system 100, or with an entity within clustered 
computing system 100. 

[0044] Packets entering server nodes 102 and 104 
pass through IP stacks 214 and 21 6, respectively. Clus- 
ter networking multiplexer 21 B can also send packets to 
IP stack 21 5 within interface node/server node 1 03, be- 
cause node 1 03 is also able to act as a server. On server 
node 1 02, packets pass through IP stack 214 into TCP 
module 206, which supports TCP connections, or into 
UDP module 21 0, which supports UDP sessions. Simi- 
larly, on interface node/server node 103, packets pass 
through IP stack 21 5 into TCP module 207, or into UDP 
module 21 1 . On server node 1 04, packets pass through 
IF stack 21 6 into TCP module 208, or into UDP module 
212. Next, the packets are processed by server instanc- 
es 201-203 on nodes 102-104, respectively. 
[0045] Note that return communications for server 
nodes 1 02 and 1 04 do not follow the same path. Return 
communication from server node 102 pass down 
through IP stack 214, through public interface 220 and 
then to client 1 21 . Similarly, return communications from 



server node 104 pass down through IP stack 216, 
through public interface 222 and then to client 121 . This 
frees interface node 103 from having to handle return 
communication traffic. 

5 [0046] For web server applications (and some other 
applications), this return communication mechanism 
can provide load balancing for the return traffic. Note 
that web servers typically receive navigational com- 
mands from a client, and in response send large vol- 

10 umes of web page content (such as graphical images) 
back to the client. For these applications, it is advanta- 
geous to distribute the return traffic over multiple return 
pathways to handle the large volume of return traffic. 
[0047] Note that within a server node, such as server 

is node 1 04, shared I P addresses are hosted on the "loop- 
back interface" of server node 1 04. (The loopback inter- 
face is defined within the UNIX and SOLARIS™ oper- 
ating system standards. Solaris is a trademark of Sun 
Microsystems, Inc. of Palo Alto, California). Hosting a 

20 shared IP address on a loopback interface has faiiover 
implications. The first interface in the loopback is typi- 
cally occupied by the loopback address (for example, 
127.0.0. 1 ), which will not fail over. This prevents a prob- 
lem in which failing over an IP address that occupies the 

25 physical space of an interface causes configuration data 
to be lost for logical adapters associated with other IP 
addresses hosted on the same interface. 

Data Structures to Support Scalable Services 

30 

[0048] FIG. 3 illustrates data structures associated 
with a scalable service in accordance with an embodi- 
ment of the present invention. HA PDT server 230 con- 
tains at least one service group 302. Note that service 

35 group 302 can be associated with a group of services 
that share a load balancing policy. 
[0049] Also note that service group 302 may have an 
associated secondary version on another node for high 
availability purposes. Any changes to service group 302 

to may be checkpointed to this secondary version so that 
if the node containing the primary version of service 
group 302 fails, the node containing the secondary ver- 
sion can take over. 

[0050] Service group 302 may also be associated with 
45 a number of "slave" versions of the service object locat- 
ed on other nodes in clustered computing system 100. 
This allows the other nodes to access the data within 
service group 302. Any changes to service group 302 
may be propagated to the corresponding slave versions. 
so [0051] Service group 302 includes a number of data 
structures, including packet distribution table (PDT) 
304, load balancing policy 306, service object 308, con- 
figuration node list 310 and instance node list 312. 
[0052] Configuration node list 310 contains a list of 
55 server nodes within clustered computing system 100 
that can provide the services associated with service 
group 302. Instance node list 312 contains a list of the 
nodes that are actually being used to provide these serv- 
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ices. Service object 308 contains information related to 
one or more services associated with service group 302. 
[0053] Load balancing policy 306 contains a descrip- 
tion of a load balancing policy that is used to distribute 
packets between nodes involved in providing services 
associated with service group 302. For example, a pol- 
icy may specify that each node in instance node list 312 
receives traffic from a certain percentage of the source 
addresses of clients that request services associated 
with service group 302. 

[0054] POT 304 is used to implement the load balanc- 
ing policy. PDT 304 includes entries that are populated 
with identifiers for nodes that are presently able to re- 
ceive packets for the services associated with service 
group 302. In order to select a server node to forward a 
packet to, the system hashes the source address of the 
client that sent the packet over PDT 304. This hashing 
selects a particular entry in PDT 304, and this entry iden- 
tifies a server node within clustered computing system 
100. 

[0055] Note that any random or pseudo- random func- 
tion can be used to hash the source address. However, 
it is desirable for packets with the same source address 
to map to the same server node in order to support a 
TCP connection (or UDP session) between a client and 
the server node. 

[0056] Also note that the frequency of entries can be 
varied to achieve different distributions of traffic be- 
tween different server nodes. For example, a high per- 
formance server node that is able to process a large 
amount of traffic can be given more entries in PDT 304 
than a slower server node that is able to process less 
traffic. In this way, the high-performance server node will 
on average receive more traffic than the slower server 
node. 

[0057] Also note that if a PDT server fails with conf ig- 
u ration data present in its local memory, then a second- 
ary PDT server will take over. A checkpointing process 
ensures that the configuration data will also be present 
in the local memory for the secondary PDT server. More 
specifically, FIG. 9 illustrates how a PDT server is check- 
pointed to a slave PDT server and a secondary PDT 
server in accordance with an embodiment of the present 
invention. As illustrated In FIG. 9, the system maintains 
a primary/master PDT server 91 2 on node 91 0. For high 
availability purposes, the state of primary/master PDT 
server 912 is regularly checkpointed to secondary PDT 
server 904 on node 902 so that secondary PDT server 
904 is kept consistent with primary/master PDT server 
912. In this way, if primary/master PDT server 91 2 fails, 
secondary PDT server 904 is able to take its place. 
[0058] If primary/master PDT server 91 2 is not locat- 
ed on an interface node 906, a slave PDT server 908 is 
maintained on interface node 906 for performance rea- 
sons (not high availability reasons). In this case, most 
of the state of primary/master PDT server 912 is regu- 
larly checkpointed to slave PDT server 908 in interface 
node 906. This allows Interface node 906 to access the 



information related to packet forwarding locally, within 
slave PDT server 908, without having to communicate 
with node primary/master PDT server 91 2 on node 910. 

5 Packet Forwarding 

[0059] FIG. 4 illustrates how an IP packet 400 is en- 
capsulated with a DLP1 header 402 in accordance with 
an embodiment of the present invention. In order for an 
IP packet 400 to be forwarded between interface node 
1 03 and server node 1 04 (see FIG. 2), DLPI header 402 
is attached to the head of IP packet 400. Note that DLPI 
header 402 includes the medium access control (MAC) 
address of one of the Interfaces of the destination server 
node 104. Also note that IP packet 400 includes a des- 
tination address 404 that specifies an IP address of a 
service that is hosted by interface node 103, as well as 
the source address 406 for a client that sent the packet. 

Configuration Process 

[0060] FIG. 5A is a flow chart illustrating the process 
of service registration in accordance with an embodi- 
ment of the present invention. The system starts by at- 
tempting to configure a scalable service for a particular 
IP address and port number (step 502). The system first 
creates a service group object (step 503), and then cre- 
ates a service object for the scalable service (step 504). 
The system also initializes a configuration node list 31 0 
(see FIG. 3) to indicate which server nodes within clus- 
tered computing system 100 are able to provide the 
service (step 506), and sets load balancing policy 306 
for the service. Note that a particular load balancing pol- 
icy can specify weights for the particular server nodes 
(step 508). 

[0061] FIG. 5B is a flow chart illustrating the process 
of service activation/deactivation in accordance with an 
embodiment of the present invention. This process hap- 
pens whenever an instance is started or stopped, or 
whenever a node fails. For every scalable service, the 
system examines every node on the configuration node 
list 310. If the node matches the running version of the 
scalable service, then the node is added to PDT 304 
and to instance node list 312 (step 510). 
[0062] If at some time in the future a node goes down 
or the service does down, a corresponding entry is re- 
moved from PDT 304 and instance node list 312 (step 
512). 

Packet Processing 

[0063] FIG. 6 is a flow chart illustrating how a packet 
is processed within an interface node in accordance with 
an embodiment of the present invention. The system 
starts by receiving IP packet 400 from client 1 22 at clus- 
ter networking multiplexer 21 8 within interface node 1 03 
(step 601 ). I P packet 400 includes a destination address 
404 specifying a service, and a source address 406 of 
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the client that sent the packet. 
[0064] The system first looks up a service for the 
packet based upon destination address 404 (step 602), 
This lookup process is described in more detail with ref- 
erence to FIG. 7 below. 

[0065] The system next determines if the server is a 
scalable service (step 603). If not, the system sends the 
packet to I P stack 21 5 within interface node/server node 
1 03, so that server instance 202 can provide the non- 
scalable service (step 604). Alternatively, interface node 
1 03 can send the packet to a default server node outside 
of interface node/server node 103 to provide the non- 
scalable service, For example, server node 104 can be 
appointed as a default node for non-scalable services. 
[0066] If the service is a scalable service, the system 
determines which server node to send the packet to. In 
doing so, the system first determines whetherthe packet 
is subject to client affinity (step 605). If so, the system 
hashes the source IP address over PDT 304 to select 
an entry from PDT 304 (step 606). If not, the system 
hashes the source IP address and the port number over 
PDT table 304 (step 607). 

[0067] Next, the system determines if is the protocol 
is TCP (step 60B). If the protocol is not TCP (meaning 
it is UDP), the system retrieves an identifier for a server 
node from the entry (step 61 1 ). Otherwise if the protocol 
is TCP, the system determines whether the current IP 
number and address are in a forwarding list (step 609). 
if so, the system retrieves the server identifier from the 
forwarding list (step 610). Otherwise, the system re- 
trieves the server identifier from the selected entry in 
PDT 304 (step 611). 

[0068] Next, the system forwards the packet to the 
server node (step 612). This forwarding process is de- 
scribed in more detail below with reference to FIG. 8. 
[0069] Interface node 103 then allows the selected 
server node to send return communications directly 
back to the client (step 614). 

Process of Looking up a Service 

[0070] FIG. 7 is a flow chart illustrating the process of 
looking up a service for a packet in accordance with an 
embodiment of the present invention. The system starts 
by performing a look up based upon the destination ad- 
dress in a first hash table (step 702). This lookup in- 
volves using the protocol, IP address and port number 
of the service. If an entry is returned during this lookup, 
the process is complete and a scalable service is re- 
turned. 

[0071] Otherwise, the system looks up a scalable 
service in a second hash table based upon the destina- 
tion address (step 706). In this case, only the protocol 
and the IP address are used to perform the lookup. This 
is because the second lookup involves a scalable serv- 
ice with a "client affinity" property. This client affinity 
property attempts to ensure that related services are 
performed on the same server node for the same client. 



Hence, the second hash table associates related serv- 
ices with the same IP address but with different port 
numbers with the same server node. 
[0072] If no entry is returned in the second lookup, 

5 then the service is a non-scalable service and the sys- 
tem signals this fact (step 710). Otherwise, if an entry is 
returned in the second lookup, the process is complete 
and a scalable service of the second type is returned. 
[0073] In one embodiment of the present invention, 

10 the first lookup selects services to be associated with 
one load balancing policy and the second lookup selects 
services to be associated with a second, different load 
balancing policy. 

15 Process of Forwarding a Packet 

[0074] FIG. 8 is a flow chart illustrating the process of 
forwarding a packet to a server in accordance with an 
embodiment of the present invention. At some time dur- 

20 ing an initialization process, the system ensures that the 
IP address of a service is hosted on the toopback Inter- 
face of each server node that will be used to perform the 
service (step 801 ). This allows each server node to proc- 
ess packets for the service, in spite of the fact that the 

25 service is not hosted on a public interface of the server 
node. After an I P packet 400 is received and after a serv- 
ice and a server node are selected (in step 612 of FIG. 
6), the system forwards IP packet 400 from cluster net- 
working multiplexer 21 8 in interface node 1 03 to I P stack 

30 21 6 within server node 1 04. This involves constructing 
a DLPI header402, including the MAC address of server 
node 104 (step 802), and then attaching DLPI header 
402 to IP packet 400 (see FIG. 4) (step 804). 
[0075] Next, the system sends the IP packet 400 with 

35 DLPI header402 to private interface 224 within interface 
node 103 (step 806). Private interface 224 sends IP 
packet 400 with DLPI header 402 to server node 104. 
Server node 104 receives the IP packet 400 with DLPI 
header 402 at private interface 226 (step 808). Next, a 

40 driver within server node 104 strips DLPI header 402 
from IP packet 400 (step 81 0). IP packet 400 is then fed 
into the bottom of IP stack 21 6 on server node 104 (step 
812). IP packet 400 subsequently passes through IP 
stack 216 on its way to server instance 203. 

45 [0076] Note that the conventional means of using a 
remote procedure call (RPC) or an interface definition 
language (IDL) call to forward a packet from interface 
node 103 to server node 104 involves traversing an IP 
stack from an RPC/IDL endpoint to private interface 224 

so within interface node 103, and then traversing another 
IP stack again at server node 1 04 from private interface 
226 to an RPC/IDL endpoint. This involves two IP stack 
traversals, and is hence, highly inefficient. 
[0077] In contrast, the technique outlined in the flow- 

55 chart of FIG. 8 eliminates the two IP stack traversals. 
[0078] Also note that, in forwarding the packet to the 
server node, the system can load balance between mul- 
tiple redundant paths between the interface node and 
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the server node by using a distribution mechanism such 
as a PDT. 

[0079] The foregoing descriptions of embodiments of 
the invention have been presented for purposes of illus- 
tration and description only. They are not intended to be 
exhaustive or to limit the invention to the forms dis- 
closed. Accordingly, many modifications and variations 
will be apparent to practitioners skilled in the an. Addi- 
tionally, the above disclosure is not intended to limit the 
invention. The scope of the invention is defined by the 
appended claims. 



Claims 

1 . A method for performing a fast lookup to determine 
a service being provided by server nodes 
(102,103,104) within a cluster of nodes (100), the 
method comprising: 

receiving (601) a packet (400) at an interface 
node (103) in the cluster of nodes, the packet 
including a source address (406) specifying a 
location of a client that the packet originated 
from, and a destination address (404) specify- 
ing a service provided by the cluster of nodes; 
using (602) the destination address to perform 
(702) a first lookup into a first lookup structure 
containing identifiers for scalable services; and 
if an identifier for a scalable service is returned 
for the packet, 

looking up a server node in the cluster of nodes 
to send the packet to based upon the source 
address of the packet, and 
sending (604) the packet to the server node, 

2. The method of claim 1, further comprising, if no 
identifier for a scalable service is returned for the 
packet (400), sending the packet to a server in- 
stance (202) in the cluster of nodes that provides a 
non-scalable service. 

3. The method of claim 2, wherein looking up the serv- 
er node comprises mapping the source address 
(406) to an entry in a packet distribution table (304), 
the packet distribution table including entries con- 
taining identifiers for server nodes in the cluster of 
nodes; and 

wherein mapping the source address includes us- 
ing (606,607) a hash function that maps different 
source addresses to different entries in the packet 
distribution table in a substantially random manner, 
and wherein the hash function always maps a given 
source address to the same entry in the packet dis- 
tribution table. 

4. The method of claim 1 , further comprising allowing 
(614) the server node to send return communica- 



tions directly to the client without forwarding the re- 
turn communications through the interface node. 

5. The method of claim 1, wherein the first lookup 
5 structure is a hash table containing the identifiers 

for the scalable services. 

6. The method of claim 1 , further comprising if the first 
lookup does not return an identifier for a scalable 

10 service, using the destination address to perform 
(706) a second lookup into a second lookup struc- 
ture containing identifiers for scalable services. 

7. The method of claim 6, wherein the first lookup 
15 structure includes identifiers for scalable services 

that use a first load balancing policy (306) to distrib- 
ute packets between server nodes, and wherein the 
second lookup structure includes identifiers for 
scalable services that use a second load balancing 
20 policy (306). 

8. The method of claim 1 , wherein a scalable service 
is provided by multiple server nodes in the cluster 
of nodes so that the scalable service provides more 

25 server node capacity for the scalable service as de- 
mand for the scalable service increases. 

9. The method of claim 8, wherein the second load bal- 
ancing policy keeps related services for a given 

30 source address (406) on the same server node. 

10. The method of claim 1 , further comprising if no scal- 
able service is returned for the packet, sending 
(604) the packet to a server instance (202) on the 

35 interface node (1 03). 

11. The method of claim 1 , further comprising: 

periodically sending checkpointing information 
40 from a primary packet distribution table (PDT) 

server (912) to a secondary PDT server (904) 
so that the secondary PDT server is kept in a 
consistent state with the primary PDT server; 
and 

« jf the primary PDT server fails, allowing the sec- 

ondary PDT server to take over for the primary 
PDT server. 

12. The method of claim 1, further comprising periodt- 
so cally sending checkpointing information from a 

master packet distribution table (PDT) server (912) 
to a slave PDT server (908) located on the interface 
node (103). 

55 13. A computer program which when running on a com- 
puter or computer network is capable of performing 
the steps of any one of method claims 1 to 12. 
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14. The computer program of claim 13, embodied on a 
computer-readable storage medium. 

15. An apparatus that performs a fast lookup to deter- 
mine a service being provided by server nodes 
(102,103,104) within a cluster of nodes (100), com- 
prising: 

a receiving mechanism (601) that is configured 
to receive a packet (400) at an interface node 
(1 03) in the cluster of nodes, the packet includ- 
ing a source address (406) specifying a location 
of a client that the packet originated from, and 
a destination address (404) specifying a serv- 
ice provided by the cluster of nodes; 
a first lookup mechanism (702) that is config- 
ured to use (602) the destination address to 
perform a first lookup into a first lookup struc- 
ture containing identifiers for scalable services, 
a scalable service being provided by multiple 
server nodes in the cluster of nodes so that the 
scalable service provides more server node ca- 
pacity for the scalable service as demand for 
the scalable service increases; and 
a server node identification mechanism, where- 
in if an identifier for a scalable service is re- 
turned for the packet, the server node identifi- 
cation mechanism is configured to look up a 
server node in the cluster of nodes to send the 
packet to based upon the source address of the 
packet; and 

a sending mechanism (604) that is configured 
to send the packet to the server node. 



19. The apparatus of claim 1 5, further comprising a sec- 
ond lookup mechanism (706) wherein if the first 
lookup mechanism (702) does not return an identi- 
fier for a scalable service, the second lookup mech- 
5 anism is configured to use the destination address 
to perform a second lookup into a second lookup 
structure containing identifiers for scalable servic- 
es. 

10 20. The apparatus of claim 1 9, further comprising: 

a checkpointing mechanism that is configured 
to periodically send checkpointing information 
from a primary packet distribution table (PDT) 
is server (912) to a secondary PDT server (904) 

so that the secondary PDT server is kept in a 
consistent state with the primary PDT server; 
and 

a failover mechanism that is configured to allow 
20 the secondary PDT server to take over for the 

primary PDT server if the primary PDT server 
fails. 

21 . The apparatus of claim 20, wherein the checkpoint- 
's ing mechanism is additionally configured to period- 
ically send checkpointing information from the pri- 
mary PDT server (912) to a slave PDT server (908) 
located on the interface node. 

30 



16. The apparatus of claim 15, wherein the sending 35 
mechanism (604) is configured to send the packet 

to a server node (103) in the cluster of nodes that 
provides a non-scalable service if no identifier for a 
scalable service is returned for the packet. 

40 

17. The apparatus of claim 1 6, wherein the server node 
identification mechanism is configured to map the 
source address (406) to an entry in a packet distri- 
bution table (304), the packet distribution table in- 
cluding entries containing identifiers for server 45 
nodes in the cluster of nodes; and 

wherein mapping the source address includes us- 
ing (606,607) a hash function that maps different 
source addresses to different entries in the packet 
distribution table in a substantially random manner, so 
and wherein the hash function always maps a given 
source address to the same entry in the packet dis- 
tribution table. 



18. The apparatus of claim 15, wherein the first lookup ss 
structure is a hash table containing the identifiers 
for the scalable services. 



9 



EP1 117 222 A1 



CLUSTERED 
COMPUTING 
SYSTEM 
100 




FIG. 1 



STORAGE 
SYSTEM 
110 



10 



EP1 117 222 A1 




£L CM 


o 

< <D 




CO ™ 

a. 



cr 

UJ 

> 
cr 

UJ 
00 



UJ 
LU O 
H < 
< Ll_ 

> 0£ 



LU 
-> LU 

m a: 

3 LU 
0- K 



UJ 

LU 



UJ 

z 



HAPDT 
SERVER 
230 














o 

< m 




CL 

O ° 


CO « 
Q. 



LU 




UJ o 












> k 

£ UJ 


eg 

CM 







si 

I— fx 

00 O 



a: 

LU 

s 

& 5 



4 — ► 



LU 

-J u_ 

CD X£. 
3 LU 
EL |- 
Z 



I 



UJ 
UJ O 
K < 



LL 10 
(V CM 

Z 




X 



SERVER 
NODE 
102 






I PRIVATE 
«- INTERFACE 
228 








/ LU \ 
/ ^ Z - \ 


3 ^ 


IP STACK 
214 


4 


► 


PUBLIC 
INTERFACE 
220 


[ > < o U ► 

1 £X J- CM / 
\ LU 00 / 









11 



EP1 117 222 A1 



TO HA 
SECONDARY 




NODE 1 



NODE 1 



NODE 2 



NODE 1 



NODE 3 



r 



POT 304 



LOAD BALANCING 
POLICY 
306 



CONFIGURATION 
NODE LIST 
310 



SERVICE 
OBJECT 
308 



INSTANCE 
NODE LIST 
312 



SERVICE GROUP 302 



HAPDT 
SERVER 
230 



V 7 

TO SLAVE 



FIG. 3 



r 



IP PACKET 400 



DESTINATION 
ADDRESS 
404 



SOURCE 
ADDRESS 
406 



BODY 
408 




FIG. 4 



12 



EP1 117 222 A1 



START ^ 



7 



CONFIGURE SCALABLE 
SERVICE 
502 



CREATE SERVICE GROUP 
OBJECT 
503 



I 



CREATE SERVICE OBJECT 
504 



T 



INITIALIZE CONFIGURATION 
NODE LIST 
506 



T 



SET LOAD BALANCING 
POLICY (WEIGHTS) 
508 



^ END ^ 



START ^ 



I 



FOR EVERY SERVICE ON 

EVERY NODE IN THE 
CONFIGURATION NODE 
LIST, IF THE NODE 
MATCHES A SERVICE THEN 

(1) ADD TO PDT TABLE AND 

(2) ADD TO INSTANCE NODE 

LIST 
510 



IF NODE GOES DOWN OR 
SERVICE GOES DOWN, (1) 
REMOVE FROM PDT TABLE 
AND (2) REMOVE FROM 
INSTANCE LIST 
512 



Q END ^ 

FIG. 5B 



FIG. 5A 



13 



EP1 117 222 A1 



f START \ 
V. 600 J 



RECEIVE PACKET AT INTERFACE 
NODE FROM CLIENT 601 



LOOKUP SERVICE FOR PACKET 
BASED ON DESTINATION ADDRESS 
OF PACKET 602 




SEND PACKET TO 
LOCAL IP STACK 
604 



HASH SOURCE IP ADDRESS 
OVER PDT TABLE 606 



HASH SOURCE IP 
ADDRESS AND PORT# 
OVER PDT TABLE 607 




YES 



I NO 


NO 


RETRIEVE IDENTIFIER FOR SERVER 
NODE FROM PDT TABLE 612 




*4 




M 


FORWARD PACKET TO SERVER 
NODE 613 


♦ 


ALLOW SERVER TO RETURN 
COMMUNICATIONS DIRECTLY TO 
CLIENT 614 



RETRIEVE SERVER 

ID FROM 
FORWARDING LIST 
611 



•C 



END 
620 



3 



FIG. 6 



14 



EP1 117 222 A1 



f START X 

v 700 J 



LOOKUP SCALABLE 
SERVICE IN FIRST HASH 

TABLE BASED ON 
PROTOCOL, IP ADDRESS 
AND PORT NUMBER OF 
DESTINATION 
702 




LOOKUP SCALABLE 
SERVICE IN SECOND HASH 
TABLE BASED ON 
PROTOCOL AND IP 
ADDRESS 
706 




SIGNAL NON-SCALABLE 
SERVICE 
710 

I 

FIG. 7 



(START \ 

ENSURE IP ADDRESS OF 
SERVICE IS HOSTED ON 
LOOPBACK INTERFACE OF 
SERVER NODE 
801 

i ' 

CONSTRUCT DLPI HEADER 
WITH MAC ADDRESS OF 
SERVER NODE 
802 

\ 

ATTACH DLPI HEADER TO IP 
PACKET 
804 

I 

SEND PACKET TO PRIVATE 
INTERFACE 
806 



RECEIVE PACKET ON 
PRIVATE INTERFACE AT 
SERVER 
808 

I 

STRIP DLPI HEADER FROM 
PACKET 
810 



FEED PACKET INTO IP 
STACK ON SERVER 
812 




FIG. 8 



15 



EP 1 117 222 A1 



NODE 902 




SECONDARY 






PDT SERVER 






904 







CHECK- 
POINTING 



INTERFACE 


NODE 906 




SLAVE PDT 






SERVER 


< 




908 







CHECK- 
POINTING 



NODE 910 




PRIMARY/ 






MASTER 






PDT SERVER 






912 







FIG. 9 



16 



EP1 117 222 A1 



European Patent 
Office 



EUROPEAN SEARCH REPORT 



Application Number 

EP 00 20 4324 



DOCUMENTS CONSIDERED TO BE RELEVANT 



Category 



X 
Y 



Citation of documtrrt wrth hdfcatJon, where appropriate, 
of retevanl paretpss 



EP 0 865 180 A (LUCENT TECHNOLOGIES INC) 
16 September 1998 (1998-09-16) 

* column 5, line 20-49 * 

* column 9, line 5-50 * 

* column 12, line 32-38 * 

HUNT G D H ET AL: "Network Dispatcher: a 
connection router for scalable Internet 
services" 

COMPUTER NETWORKS AND ISDN SYSTEMS , NORTH 
HOLLAND PUBLISHING. AMSTERDAM, NL, 
vol. 30, no. 1-7, 

1 April 1998 (1998-04-01), pages 347-357, 

XP004121412 

ISSN: 0169-7552 

* paragraph '0005! - paragraph '06.2! * 

WO 98 26559 A (GTE INTERNETWORKING INC) 
18 June 1998 (1998-06-18) 

* page 17 - page 21 * 



Rstevant 
todakn 



CLASSIFICATION OF THE 
APPUCATON (tntCTT) 



1,4,5, 
13-15,18 
9,11,12 



9,11,12 



H04L29/06 



The proserrl search leport has been drawn up tor all dakni 



1-4,6-8, 
13-16,19 



TECHNICAL REUS 
SEARCHED (tnt.CL7) 



H04L 



PfaoaofMMi 

THE HAGUE 



19 April 2001 



Dupuls, H 



CATEGORY OF CITED DOCUMENTS 

X : particularly retownm * taken alone 

Y : parttculofly relevant If combined wtirt another 

document of the Him category 
A : tBChnoicgtcBi beahgrwid 
O : non-wittten dsckxm 
y : InteimedUta document 



T : Wary or prindpto underlying the Invention 
E : MrBer p*»nt document, but published on. or 

after the fling date 
D : dooumerrt dted in t*> appiteaJon 
1. document UuxJ for other reoAons 



& : member at the same patent hmtty, oorre^xndirg 
rJoeunent 



17 



EP 1 117 222 A1 



ANNEX TO THE EUROPEAN SEARCH REPORT 
ON EUROPEAN PATENT APPLICATION NO. 



EP 00 20 4324 



Tnls annex lists the patent faintly membeis relating to the patent documents cited In the above-mentioned European search report. 
The members are at contained In the European Patent Office EDP De on 

fne European Patent Office Is In no way liable for these partkaiara which are merely giver for the purpose et Information. 

19-04-2001 



Potent document 
cHed In search report 



Publication 



Patent family 
mernber(s} 



Publication 
date 



EP 0665180 



16-09-1998 



CA 2230550 A 



W0 9826559 



18-06-1998 



US 
AU 
AU 
EP 
US 



6185619 B 
724096 B 
5692498 A 
1016253 A 
6195399 B 



14-09-1998 

06-02-2001 
14-09-2000 
03-07-1998 
05-07-2000 
27-02-2001 



3 Far more details about thb annex : see Offlctal Journal of the European Patent Office, No. 12/82 



18 



